9,457 research outputs found

    A Bayesian Model for Cluster Detection

    Get PDF
    The detection of areas in which the risk of a particular disease is significantly elevated, leading to an excess of cases, is an important enterprise in spatial epidemiology. Various frequentist approaches have been suggested for the detection of “clusters” within a hypothesis testing framework. Unfortunately, these suffer from a number of drawbacks including the difficulty in specifying a p-value threshold at which to call significance, the inherent multiplicity problem, and the possibility of multiple clusters. In this paper, we suggest a Bayesian approach to detecting “areas of clustering” in which the study region is partitioned into, possibly multiple, “zones” within which the risk is either at a null, or non-null, level. Computation is carried out using Markov chain Monte Carlo, tuned to the model that we develop. The method is applied to leukemia data in upstate New York

    A Bayesian Method for Cluster Detection with Application to Five Cancer Sites in Puget Sound

    Get PDF
    Cluster detection is an important public health endeavor and in this paper we describe and apply a recently developed Bayesian method. Commonly-used approaches are based on so-called scan statistics and suffer from a number of difficulties including how to choose a level of significance and how to deal with the possibility of multiple clusters. The basis of our model is to partition the study region into a set of areas which are either “null” or “non-null”, the latter corresponding to clusters (excess risk) or anti-clusters (reduced risk). We demonstrate the Bayesian method and compare with a popular existing approach, using data on breast, brain, lung, prostate and colorectal cancer, in the Puget Sound region of Washington St ate. We address the important issues of sensitivity to the priors, and the incorporation of covariates. The approach is implemented within the freely-available R package SpatialEpi

    A Permutation Test and Spatial Cross-Validation Approach to Assess Models of Interspecific Competition Between Trees

    Get PDF
    Measuring species-specific competitive interactions is key to understanding plant communities. Repeat censused large forest dynamics plots offer an ideal setting to measure these interactions by estimating the species-specific competitive effect on neighboring tree growth. Estimating these interaction values can be difficult, however, because the number of them grows with the square of the number of species. Furthermore, confidence in the estimates can be overestimated if any spatial structure of model errors is not considered. Here we measured these interactions in a forest dynamics plot in a transitional oak-hickory forest. We analytically fit Bayesian linear regression models of annual tree radial growth as a function of that tree’s species, its size, and its neighboring trees. We then compared these models to test whether the identity of a tree’s neighbors matters and if so at what level: based on trait grouping, based on phylogenetic family, or based on species. We used a spatial crossvalidation scheme to better estimate model errors while avoiding potentially over-fitting our models. Since our model is analytically solvable we can rapidly evaluate it, which allows our proposed cross-validation scheme to be computationally feasible. We found that the identity of the focal and competitor trees mattered for competitive interactions, but surprisingly, identity mattered at the family rather than species-level

    OkCupid Data for Introductory Statistics and Data Science Courses

    Get PDF
    We present a data set consisting of user profile data for 59,946 San Francisco OkCupid users (a free online dating website) from June 2012. The data set includes typical user information, lifestyle variables, and text responses to 10 essay questions. We present four example analyses suitable for use in undergraduate introductory probability and statistics and data science courses that use R. The statistical and data science concepts covered include basic data visualization, exploratory data analysis, multivariate relationships, text analysis, and logistic regression for prediction

    The fivethirtyeight R package: ‘Tame Data’ Principles for Introductory Statistics and Data Science Courses

    Get PDF
    As statistics and data science instructors, we often seek to use data in our courses that are rich, real, realistic, and relevant. To this end we created the fivethirtyeight R package of data and code behind the stories and interactives at the data journalism website FiveThirtyEight.com. After a discussion on the conflicting pedagogical goals of minimizing prerequisites to research (Cobb 2015) while at the same time presenting students with a realistic view of data as it exists in the wild, we articulate how a desired balance between these two goals informed the design of the package. The details behind this balance are articulated as our proposed Tame data principles for introductory statistics and data science courses. Details of the package\u27s construction and example uses are included as well

    The Forestecology R Package for Fitting and Assessing Neighborhood Models of the Effect of Interspecific Competition on the Growth of Trees

    Get PDF
    Neighborhood competition models are powerful tools to measure the effect of interspecific competition. Statistical methods to ease the application of these models are currently lacking. We present the forestecology package providing methods to (a) specify neighborhood competition models, (b) evaluate the effect of competitor species identity using permutation tests, and (cs) measure model performance using spatial cross-validation. Following Allen and Kim (PLoS One, 15, 2020, e0229930), we implement a Bayesian linear regression neighborhood competition model. We demonstrate the package\u27s functionality using data from the Smithsonian Conservation Biology Institute\u27s large forest dynamics plot, part of the ForestGEO global network of research sites. Given ForestGEO’s data collection protocols and data formatting standards, the package was designed with cross-site compatibility in mind. We highlight the importance of spatial cross-validation when interpreting model results. The package features (a) tidyverse-like structure whereby verb-named functions can be modularly “piped” in sequence, (b) functions with standardized inputs/outputs of simple features sf package class, and (c) an S3 object-oriented implementation of the Bayesian linear regression model. These three facts allow for clear articulation of all the steps in the sequence of analysis and easy wrangling and visualization of the geospatial data. Furthermore, while the package only has Bayesian linear regression implemented, the package was designed with extensibility to other methods in mind

    Using Labeled Data to Evaluate Change Detectors in a Multivariate Streaming Environment

    Get PDF
    We consider the problem of detecting changes in a multivariate data stream. A change detector is defined by a detection algorithm and an alarm threshold. A detection algorithm maps the stream of input vectors into a univariate detection stream. The detector signals a change when the detection stream exceeds the chosen alarm threshold. We consider two aspects of the problem: (1) setting the alarm threshold and (2) measuring/comparing the performance of detection algorithms. We assume we are given a segment of the stream where changes of interest are marked. We present evidence that, without such marked training data, it might not be possible to accurately estimate the false alarm rate for a given alarm threshold. Commonly used approaches assume the data stream consists of independent observations, an implausible assumption given the time series nature of the data. Lack of independence can lead to estimates that are badly biased. Marked training data can also be used for realistic comparison of detection algorithms. We define a version of the receiver operating characteristic curve adapted to the change detection problem and propose a block bootstrap for comparing such curves. We illustrate the proposed methodology using multivariate data derived from an image stream
    corecore